First-Order Methods
Spoiler alert: Some methods are trash!
Gradient Descent
xxxxxxxxxxusing CalculusWe can choose the direction of steepest descent as the descent direction d, and this is the opposite of the gradient.
It is proven (proof left to the reader as excercise) that each direction is orthogonal to the previous.
xxxxxxxxxxusing Base.MathConstantsfibonacci_search (generic function with 1 method)bracket_minimum (generic function with 2 methods)function bracket_minimum(f::Function, x=0.0; s=1e-12, k=2.0) try a, ya = x, f(x) global b, yb = x+s, f(x+s) if yb > ya #If we are not going downhill c, yc = x-s, f(x-s) s = -s else c, yc = x+k*s, f(x+k*s) end if ya > yb < yc # Definition of unimodal return (a, c) end s = s*k catch e println(e) return (a, c) end return bracket_minimum(f, b; s=s, k=k)endminimize (generic function with 1 method)x
minimize(f::Function, a::Number, b::Number)::Number = fibonacci_search(f, a, b, 5)optimize_line_search (generic function with 1 method)xxxxxxxxxxfunction optimize_line_search(f, x, d)::Float64 objective(α::Number)::Number = f(x + α*d) a, b = bracket_minimum(objective; k=1.05) α = minimize(objective, a, b) αendgradient_descent_mixed (generic function with 1 method)xxxxxxxxxxfunction gradient_descent_mixed(f, ∇f, x, k)::Array{Tuple, 1} points::Array{Tuple} = Array{Tuple}(undef, k+1) points[1] = tuple(x...) for i in 1:k d = -∇f(x) d = d/sqrt(sum(2 .^ d)) α = optimize_line_search(f, x, d) x = x + α*d points[i+1] = tuple(x...) end pointsendf₁ (generic function with 2 methods)xxxxxxxxxxbegin f₁(x, y)::Number = (1-x)^2 + 5(y - x^2)^2 f₁(x̄)::Number = f₁(x̄...)endxxxxxxxxxxusing Plotsxxxxxxxxxxusing PlutoUIx
begin x₁ = range(-3, stop=3, length=200) y₁ = x₁ color₁ = cgrad([:grey, :blue]) ∇f₁(x̄) = Calculus.gradient(f₁, x̄) points₁ = [x for x in gradient_descent_mixed(f₁, ∇f₁, [x₁ᵢ, y₁ᵢ], n₁)] plot₁ = contour(x₁, y₁, f₁, c=color₁, levels=700 .^ (range(-.2,stop=1,length=14))) scatter!((1.0, 1.0), markersize=3, markerstrokewidth=0, c=:cyan, label="Optimal point") scatter!((x₁ᵢ, y₁ᵢ), markersize=2, c=:black, label="Initial point") plot!(points₁, c=:red, label="Descent") plot₁endThe optimal point is near (0.9605721406277097, 0.9196323117617097)
The optimized value is 0.00160157399060238
xxxxxxxxxxabstract type DescentMethod endxxxxxxxxxxbegin struct GradientDescent <: DescentMethod α endendinit! (generic function with 1 method)xxxxxxxxxxinit!(M::GradientDescent, f, ∇f, x) = Mstep! (generic function with 1 method)xxxxxxxxxxfunction step!(M::GradientDescent, f, ∇f, x)::Array{Float64, 1} α, g = M.α, ∇f(x) return x - α*gendoptimize! (generic function with 1 method)xxxxxxxxxxfunction optimize!(M::DescentMethod, f, ∇f, x; k=30)::Array{Tuple} init!(M, f, ∇f, x) points::Array{Tuple} = Array{Tuple}(undef, k+1) for i in 1:k points[i] = tuple(x...) x = step!(M, f, ∇f, x) end points[end] = tuple(x...) pointsendf₁ (generic function with 2 methods)xxxxxxxxxxf₂ = f₁The optimal point is near (0.7414365199553092, 0.6976297924702967) 👀
The optimized value is 0.17622960698092363 😒🤢
Gradient descent (alone) is not miraculous 😢
Conjugate Gradient
Gradient descent is 🗑️ in narrow valleys. This method uses a quadratic approximation of the local function to find optimal points.
Conjugate Gradient Descent is GUARANTEED to optimize a n-dimensional quadratic function in n steps.
This method can be applied to nonquadratic functions. Smooth functions behave quadratic close to a local minimum (oh-oh).
This means that we need to get close to the minimum in order to obtain benefits. 🗑️
This method uses information from last gradients to improve:
Fletcher-Reeves:
Polak-Ribiere:
👀 For guaranteed convergence on the Polak-Ribiere method we need to reset β to 0 if it goes negative.
ConjugateGradientDescentxxxxxxxxxxbegin mutable struct ConjugateGradientDescent <: DescentMethod d::Array{Number, 1} g::Array{Number, 1} end function ConjugateGradientDescent() return ConjugateGradientDescent(Number[], Number[]) endendinit! (generic function with 2 methods)xxxxxxxxxxfunction init!(M::ConjugateGradientDescent, f, ∇f, x) M.g = ∇f(x) M.d = -M.g return Mendxxxxxxxxxxusing LinearAlgebrastep! (generic function with 10 methods)xxxxxxxxxxfunction step!(M::ConjugateGradientDescent, f, ∇f, x)::Array{Float64, 1} d, g = M.d, M.g g′ = ∇f(x) β = (g′ ⋅ (g′ - g))/(g ⋅ g) β = β*(β>0) d′ = -g′ + β*d α = optimize_line_search(f, x, d′) x′ = x + α*d′ M.d, M.g = d′, g′ return x′endf₁ (generic function with 2 methods)xxxxxxxxxxf₃ = f₁The optimal point is near (0.9926081864948202, 0.9847483130588324) 🌟
The optimized value is 5.6004977270041056e-5 🥳
Momentum
The idea is to simulate a force (that updates velocity) to avoid slow movement in a nearly flat surface (gradients with small magnitude will fail miserably).
Momentumxxxxxxxxxxbegin mutable struct Momentum <: DescentMethod α::Number β::Number v::Array{Number, 1} end Momentum(α, β) = Momentum(α, β, Number[])endinit! (generic function with 3 methods)xxxxxxxxxxfunction init!(M::Momentum, f, ∇f, x) M.v = zeros(length(x))endstep! (generic function with 3 methods)xxxxxxxxxxfunction step!(M::Momentum, f, ∇f, x)::Array{Float64, 1} α, β, v, g = M.α, M.β, M.v, ∇f(x) v[:] = β*v - α*g #Update velocity return x + vendf₄ (generic function with 2 methods)xxxxxxxxxxbegin f₄(x, y)::Number = (1-x)^2 + 100(y - x^2)^2 f₄(x̄)::Number = f₄(x̄...)endxxxxxxxxxxbegin x₄ = range(-1.5, stop=1.25, length=200) y₄ = range(-.45, stop=1.5, length=200) color₄ = cgrad([:yellow,:green, :blue, :red, :purple]) ∇f₄(x̄) = Calculus.gradient(f₄, x̄) points₄ = [x for x in optimize!(Momentum(.005, 1), f₄, ∇f₄, [x₄ᵢ, y₄ᵢ]; k=n₄)] contour(x₄, y₄, f₄, c=color₄, levels=2500 .^ (range(-.5,stop=1,length=25)), legend=false) scatter!((1.0, 1.0), markersize=3, markerstrokewidth=0, c=:cyan, label="Optimal point") scatter!((x₄ᵢ, y₄ᵢ), markersize=4, c=:black, label="Initial point") plot!(points₄, c=:red, label="Descent") scatter!(points₄, c=:green, markersize=2.5, markerstrokewidth=0, label="Path")endThe optimal point is near (1.0263932677227146, 1.304148459285668) 🤷♀️
The optimized value is 6.28400683251055 🤦♂️
It's like the point is falling straight to the minimum 😲
But... it might go too fast
Nesterov Momentum
If it goes too fast try predicting how fast it will go the next iteration...
step! (generic function with 4 methods)xxxxxxxxxxbegin mutable struct NesterovMomentum <: DescentMethod α::Number β::Number v::Array{Number, 1} end function NesterovMomentum(α::Number, β::Number) NesterovMomentum(α, β, Number[]) end function init!(M::NesterovMomentum, f, ∇f, x) M.v = zeros(length(x)) return M end function step!(M::NesterovMomentum, f, ∇f, x)::Array{Float64, 1} α, β, v = M.α, M.β, M.v next_g = β*v v[:] = next_g - α*∇f(x + next_g) return x + v endendf₄ (generic function with 2 methods)xxxxxxxxxxf₅ = f₄xxxxxxxxxx n₅ Slider(1:1000, default=42, show_value=true)xxxxxxxxxx x₅ᵢ Slider(-1.5:.01:1.5, default=-0.77, show_value=true)xxxxxxxxxx y₅ᵢ Slider(-1.5:.01:1.5, default=1.42, show_value=true)x
begin x₅ = range(-1.5, stop=1.25, length=200) y₅ = range(-.45, stop=1.5, length=200) color₅ = cgrad([:yellow,:green, :blue, :red, :purple]) ∇f₅(x̄) = Calculus.gradient(f₅, x̄) points₅ = [x for x in optimize!(NesterovMomentum(.0009, .95), f₅, ∇f₅, [x₅ᵢ, y₅ᵢ]; k=n₅)] contour(x₅, y₅, f₅, c=color₅, levels=2500 .^ (range(-.5,stop=1,length=25)), legend=false) scatter!((1.0, 1.0), markersize=3, markerstrokewidth=0, c=:cyan, label="Optimal point") scatter!((x₅ᵢ, y₅ᵢ), markersize=4, c=:black, label="Initial point") plot!(points₅, c=:red, label="Descent") scatter!(points₅, c=:green, markersize=1.5, markerstrokewidth=0, label="Path")endThe optimal point is near (0.9999678362763855, 0.9999355449078215) 🤔
The optimized value is 1.03616095695321e-9 🤔🤔
It's still going too fast...
Adagrad
Adaptative subgradient method
This method updates learning rate for each component.
ϵ is a small value (1e-8) to prevent division by zero.
α is usually set to .01, because of all the operations it doesn't matter too much
Problem: As we can see, the components of S grow over time, making the learning rate decrease (and become infinitesimally small even before convergence).
step! (generic function with 5 methods)xxxxxxxxxxbegin mutable struct Adagrad <: DescentMethod α::Number ϵ::Number s::Array{Float64, 1} end Adagrad(α::Number, ϵ::Number) = Adagrad(α, ϵ, Float64[]) Adagrad(α::Number) = Adagrad(α, 1e-8) function init!(M::Adagrad, f, ∇f, x) M.s = zeros(length(x)) return M end function step!(M::Adagrad, f, ∇f, x)::Array{Float64, 1} α, ϵ, s, g = M.α, M.ϵ, M.s, ∇f(x) s[:] += g .* g return x - α*g ./ (sqrt.(s) .+ ϵ) endendf₄ (generic function with 2 methods)xxxxxxxxxxf₆ = f₄The optimal point is near (-0.3043008040570129, 0.09737789529620958) 🤣
The optimized value is 1.7034843912261084 🐌🐌🐌
Really slow convergence...
RMSProp
It comes to the rescue of Adagrad, solving the problem of decreasing learning rate.
⊙ = \odot (element wise multiplication)
γ is tipically close to 0.9
step! (generic function with 6 methods)xxxxxxxxxxbegin mutable struct RMSProp <: DescentMethod α::Number γ::Number ϵ::Number ŝ::Array{Float64, 1} end RMSProp(α::Number, γ::Number, ϵ::Number) = RMSProp(α, γ, ϵ, Float64[]) RMSProp(α::Number, γ::Number) = RMSProp(α, γ, 1e-8) RMSProp(α::Number) = RMSProp(α, .9) function init!(M::RMSProp, f, ∇f, x) M.ŝ = zeros(length(x)) return M end function step!(M::RMSProp, f, ∇f, x) α, γ, ϵ, ŝ, g = M.α, M.γ, M.ϵ, M.ŝ, ∇f(x) ŝ[:] = γ*ŝ + (1-γ)*(g .* g) return x - α*g ./ (sqrt.(ŝ) .+ ϵ) endendf₄ (generic function with 2 methods)xxxxxxxxxxf₇ = f₄x
begin x₇ = range(-1.5, stop=1.5, length=200) y₇ = range(-.45, stop=1.5, length=200) color₇ = cgrad([:yellow,:green, :blue, :red, :purple]) ∇f₇(x̄) = Calculus.gradient(f₇, x̄) points₇ = [x for x in optimize!(RMSProp(.04, .999, 1), f₇, ∇f₇, [x₇ᵢ, y₇ᵢ]; k=n₇)] contour(x₇, y₇, f₇, c=color₇, levels=2500 .^ (range(-.5,stop=1,length=25)), legend=false) scatter!((1.0, 1.0), markersize=3, markerstrokewidth=0, c=:cyan, label="Optimal point") scatter!((x₇ᵢ, y₇ᵢ), markersize=4, c=:black, label="Initial point") plot!(points₇, c=:red, label="Descent") scatter!(points₇, c=:green, markersize=1, markerstrokewidth=0, label="Path")endAdadelta
Removes the learning rate parameter entirely (the same authors of RMSProp did this) 😲😲
step! (generic function with 7 methods)xxxxxxxxxxbegin mutable struct Adadelta <: DescentMethod γs::Number γx::Number ϵ::Number ŝ::Array{Float64, 1} u::Array{Float64, 1} end Adadelta(γs::Number, γx::Number, ϵ::Number) = Adadelta(γs, γx, ϵ, Float64[], Float64[]) Adadelta(γs::Number, γx::Number) = Adadelta(γs, γx, 1e-2) function init!(M::Adadelta, f, ∇f, x) M.ŝ = zeros(length(x)) M.u = zeros(length(x)) return M end function step!(M::Adadelta, f, ∇f, x)::Array{Float64, 1} γs, γx, ϵ, ŝ, u, g = M.γs, M.γx, M.ϵ, M.ŝ, M.u, ∇f(x) ŝ[:] = γs*ŝ + (1-γs)*(g .* g) Δx = -(sqrt.(u) .+ ϵ) ./ (sqrt.(ŝ) .+ ϵ) .* g u[:] = γx*u + (1-γx)*(Δx .* Δx) return x + Δx endendf₄ (generic function with 2 methods)xxxxxxxxxxf₈ = f₄x
n₈ Slider(1:1000, default=102, show_value=true)xxxxxxxxxxbegin x₈ = range(-1.5, stop=1.5, length=200) y₈ = range(-.45, stop=1.5, length=200) color₈ = cgrad([:yellow,:green, :blue, :red, :purple]) ∇f₈(x̄) = Calculus.gradient(f₈, x̄) points₈ = [x for x in optimize!(Adadelta(.9, .9999, 3e-2), f₈, ∇f₈, [x₈ᵢ, y₈ᵢ]; k=n₈)] contour(x₈, y₈, f₈, c=color₈, levels=2500 .^ (range(-.5,stop=1,length=25)), legend=false) scatter!((1.0, 1.0), markersize=3, markerstrokewidth=0, c=:cyan, label="Optimal point") scatter!((x₈ᵢ, y₈ᵢ), markersize=4, c=:black, label="Initial point") plot!(points₈, c=:red, label="Descent") scatter!(points₈, c=:green, markersize=1, markerstrokewidth=0, label="Path")endAdam
Adaptative moment estimation method
Stores exponentially decaying squared gradient (RMSProp and Adadelta), but also decaying gradient like momentum.
Initializing the gradient and squared gradient to zero introduces a bias. (Good defaults are α=.001, γᵥ=.9, γₛ=.999 and ϵ=1e-8)
The equations for Adam are:
Biased decaying momentum:
Biased decaying sq. gradient:
Corrected decaying momentum:
Corrected decaying sq. gradient:
Next iterate:
step! (generic function with 8 methods)xxxxxxxxxxbegin mutable struct Adam <: DescentMethod α::Float64 γᵥ::Float64 γₛ::Float64 ϵ::Float64 k::Integer v::Array{Float64, 1} s::Array{Float64, 1} end Adam(α::Number, γᵥ::Number, γₛ::Number, ϵ::Number) = Adam(α, γᵥ, γₛ, ϵ, 0, Float64[], Float64[]) Adam(α::Number, γᵥ::Number, γₛ::Number) = Adam(α, γᵥ, γₛ, 1e-8) Adam(α::Number, γᵥ::Number) = Adam(α, γᵥ, .999) Adam(α::Number) = Adam(α, .9) function init!(M::Adam, f, ∇f, x) M.k = 0 M.v = zeros(length(x)) M.s = zeros(length(x)) return M end function step!(M::Adam, f, ∇f, x) α, γᵥ, γₛ, ϵ, k = M.α, M.γᵥ, M.γₛ, M.ϵ, M.k s, v, g = M.s, M.v, ∇f(x) v[:] = γᵥ*v + (1-γᵥ)*g s[:] = γₛ*s + (1-γₛ)*(g .* g) M.k = k += 1 #Updates k at the same time v̂ = v/(1-γᵥ^k) ŝ = s/(1-γₛ^k) return x - (α*v̂)./(sqrt.(ŝ) .+ ϵ) endendf₄ (generic function with 2 methods)xxxxxxxxxxf₉ = f₄x
n₉ Slider(1:500, default=222, show_value=true)x
begin x₉ = range(-1.5, stop=1.5, length=200) y₉ = range(-.45, stop=1.5, length=200) color₉ = cgrad([:yellow,:green, :blue, :red, :purple]) ∇f₉(x̄) = Calculus.gradient(f₉, x̄) points₉ = [x for x in optimize!(Adam(.3, .95, .99), f₉, ∇f₉, [x₉ᵢ, y₉ᵢ]; k=n₉)] contour(x₉, y₉, f₉, c=color₉, levels=2500 .^ (range(-.5,stop=1,length=25)), legend=false) scatter!((1.0, 1.0), markersize=3, markerstrokewidth=0, c=:cyan, label="Optimal point") scatter!((x₉ᵢ, y₉ᵢ), markersize=4, c=:black, label="Initial point") plot!(points₉, c=:red, label="Descent") scatter!(points₉, c=:green, markersize=1.5, markerstrokewidth=0, label="Path")endHypergradient Descent
This methods are too sensitive to the learning rate... let's optimize it first
We applied gradient descent to the learning reate 🤯🤯
step! (generic function with 9 methods)xxxxxxxxxxbegin mutable struct HyperGradientDescent <: DescentMethod α₀::Float64 #Initial learning rate μ::Float64 #Inception learning rate α::Float64 #Current learning rate g_prev::Array{Float64, 1} end HyperGradientDescent(α₀::Number, μ::Number) = HyperGradientDescent(α₀, μ, α₀, Float64[]) function init!(M::HyperGradientDescent, f, ∇f, x) M.α = M.α₀ M.g_prev = zeros(length(x)) return M end function step!(M::HyperGradientDescent, f, ∇f, x)::Array{Float64, 1} α, μ, g, g_prev = M.α, M.μ, ∇f(x), M.g_prev α = α + μ*(g ⋅ g_prev) M.g_prev, M.α = g, α return x - α*g endendf₄ (generic function with 2 methods)xxxxxxxxxxf₁₀ = f₄xxxxxxxxxxbegin x₁₀ = range(-1.5, stop=1.5, length=200) y₁₀ = range(-.45, stop=1.5, length=200) color₁₀ = cgrad([:yellow,:green, :blue, :red, :purple]) ∇f₁₀(x̄) = Calculus.gradient(f₁₀, x̄) points₁₀ = [x for x in optimize!(HyperGradientDescent(.005, 1e-8), f₁₀, ∇f₁₀, [x₁₀ᵢ, y₁₀ᵢ]; k=n₁₀)] contour(x₁₀, y₁₀, f₁₀, c=color₁₀, levels=2500 .^ (range(-.5,stop=1,length=25)), legend=false) scatter!((1.0, 1.0), markersize=3, markerstrokewidth=0, c=:cyan, label="Optimal point") scatter!((x₁₀ᵢ, y₁₀ᵢ), markersize=4, c=:black, label="Initial point") plot!(points₁₀, c=:red, label="Descent") scatter!(points₁₀, c=:green, markersize=1.5, markerstrokewidth=0, label="Path")endLet's do the same to Nesterov momentum
step! (generic function with 10 methods)xxxxxxxxxxbegin mutable struct HyperNesterovMomentum <: DescentMethod α₀::Float64 #Initial learning rate μ::Float64 #Inception learning rate β::Float64 v::Array{Float64, 1} α::Float64 g_prev::Array{Float64, 1} end HyperNesterovMomentum(α₀::Number, μ::Number, β::Number) = HyperNesterovMomentum(α₀, μ, β, Float64[], α₀, Float64[]) function init!(M::HyperNesterovMomentum, f, ∇f, x) M.α = M.α₀ M.v = zeros(length(x)) M.g_prev = zeros(length(x)) return M end function step!(M::HyperNesterovMomentum, f, ∇f, x)::Array{Float64, 1} α, β, v = M.α, M.β, M.v g, μ, g_prev = ∇f(x), M.μ, M.g_prev next_g = β*v α = α + μ*(g ⋅ (-g_prev - β*v)) v[:] = β*v + g M.α, M.g_prev = α, g_prev return x - α*(g + β*v) endendf₄ (generic function with 2 methods)xxxxxxxxxxf₁₁ = f₄xxxxxxxxxxbegin x₁₁ = range(-1.5, stop=1.5, length=200) y₁₁ = range(-.45, stop=1.5, length=200) color₁₁ = cgrad([:yellow,:green, :blue, :red, :purple]) ∇f₁₁(x̄) = Calculus.gradient(f₁₁, x̄) points₁₁ = [x for x in optimize!(HyperNesterovMomentum(.5e-3, 3e-9, .97), f₁₁, ∇f₁₁, [x₁₁ᵢ, y₁₁ᵢ]; k=n₁₁)] contour(x₁₁, y₁₁, f₁₁, c=color₁₀, levels=2500 .^ (range(-.5,stop=1,length=25)), legend=false) scatter!((1.0, 1.0), markersize=3, markerstrokewidth=0, c=:cyan, label="Optimal point") scatter!((x₁₁ᵢ, y₁₁ᵢ), markersize=4, c=:black, label="Initial point") plot!(points₁₁, c=:red, label="Descent") scatter!(points₁₁, c=:green, markersize=1.5, markerstrokewidth=0, label="Path")endplot_method! (generic function with 1 method)x
function plot_method!(M::DescentMethod, f, ∇f, x, k; curve=:red, path=:green) methodname = string(nameof(typeof(M))) points = [x for x in optimize!(M, f, ∇f, x; k=k)] plot!(points, c=curve, label="$(methodname)") scatter!(points, c=path, markersize=1, markerstrokewidht=0, label=nothing)endExercises
Exercise 5.1. Compute the gradient of
Answer: By decomposing the matrix as a sum (really long) and then replacing the sum as a matrix form we get:
Exercise 5.2. Apply gradient descent with unit step to f(x)=x⁴, compute two iterations.
∇fₑ (generic function with 2 methods)x
begin fₑ(x::Float64)::Array{Float64, 1} = [x^4] fₑ(x̄::Array{Float64, 1})::Array{Float64, 1} = fₑ(x̄[1]) ∇fₑ(x::Float64)::Array{Float64, 1} = [4*x^3] ∇fₑ(x̄::Array{Float64, 1})::Array{Float64, 1} = ∇fₑ(x̄[1])endx
begin xₑ = range(-.01, .5, length=50) plot(xₑ, (x) -> fₑ(x)[1], label="Function") pointsₑ = [x[1] for x in optimize!(GradientDescent(1), fₑ, ∇fₑ, [.4]; k=2)] plot!(pointsₑ, (x) -> fₑ(x)[1], c=:red, label="Gradient descent") scatter!(pointsₑ, (x) -> fₑ(x)[1], c=:black, markersize=1, markerstrokewidht=0, label=nothing)endExercise 5.3. Apply one step of gradient descent to f(x)=e^x+e^-x from x =10 with both a unit step and with exact line search.
∇fₘ (generic function with 2 methods)x
begin fₘ(x::Float64)::Array{Float64, 1} = [exp(x)+exp(-x)] fₘ(x̄::Array{Float64, 1})::Array{Float64, 1} = fₘ(x̄[1]) ∇fₘ(x::Float64)::Array{Float64, 1} = Calculus.derivative(fₘ, x) ∇fₘ(x̄::Array{Float64, 1})::Array{Float64, 1} = ∇fₘ(x̄[1])endAnswer: If someone can do this without the gradient exploting to infinity really quick please help.
Exercise 5.4. The conjugate gradient method can be used to find a search direction d when a local quadratic model of a function is available at the current point. With d as search direction, let the model be
Answer: We knew that:
When d = 0 the gradient will also be 0. And if this happens we will be dividing by infinitesimal values when computing the next search direction. (This happened to me when using too many iterations)
Exercise 5.5. How is Nesterov momentum an improvement over momentum?
Answer: The problem with momentum is that it doesn't slow down, and Nesterov momentum uses the calculation of the gradient at the next step (if it overshoots the gradient will make it go back, correcting the step).
Exercise 5.6. In what way is the conjugate gradient method an improvement over steepest descent?
Answer: Gradient descent is really slow near valleys (because the next direction will always be orthogonal to the previous, causing a zig-zag movement), that is corrected when the previous gradient contributes to the next direction (making it more like a straight path to the minimum).
Exercise 5.7. In the conjugate gradient descent, what is the normalized descent direction at the first iteration for the function
∇fₚ (generic function with 1 method)x
begin fₚ(x, y) = x^2 + x*y + y^2 + 5 fₚ(x̄) = fₚ(x̄...) ∇fₚ(x̄) = Calculus.gradient(fₚ, x̄)endx
begin plotly() xₚ = range(-3, 3, length=40) yₚ = range(-3, 3, length=40) pₚ = contour(xₚ, yₚ, fₚ) plot_method!(ConjugateGradientDescent(), fₚ, ∇fₚ, [1, 1], 2) gr() pₚendx
begin Mₚ = ConjugateGradientDescent() init!(Mₚ, fₚ, ∇fₚ, [1, 1]) xₚ₀ = [1, 1] xₚ₁ = step!(Mₚ, fₚ, ∇fₚ, xₚ₀) r₁, r₂ = Mₚ.d[1], Mₚ.d[2] xₚ₂ = step!(Mₚ, fₚ, ∇fₚ, xₚ₁)end;Answer: The direction is [-2.9999999999239875, -2.9999999999239875], (Almost the same on each component), And the point is [1.0056882026629292e-6, 1.0056882026629292e-6] after two iterations (expected because this function is polinomial of order two and the conjugate gradient should optimize it in two steps).
Exercise 5.8. We have a polynomial function f such that f(x) > 2 for all x in three-dimensional Euclidean space. Suppose we are using steepest descent with step lengths optimized at each step, and we want to find a local minimum of f. If our unnormalized descent direction is [1,2,3] at step k, is it possible for our unnormalized descent direction at step k+1 to be [0,0,-3]? Why or why not?
Answer: We know that in the steepest descent method the descent direction in every iteration is orthogonal to the next direction.